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ABSTRACT 



A graphical test bed in which the results of a simulation experi- 
ment can be reported and analyzed is described. The test bed is based 
on the regression adjusted graphics and estimation methodology devel- 
oped by Heidelberger and Lewis for regenerative simulation. From the 
graphics and associated numerics, the experimenter can summarize and 
see simultaneously relative properties, such as bias, normality and 
standard deviation, of several estimators of a characteristic of a pop- 
ulation for up to 8 sample sizes. The evolution of these properties 
with sample size is also displayed. The graphics is supported on a 
line printer to make it and the program portable. The technique is 
illustrated by two examples, one concerning the effects of changes in 
data distribution on the behavior of the estimated lag one serial cor- 
relation coefficient and the other concerning the relative properties 
of several estimators of a Gamma distribution. 



1.0 Introduction 



SIMTBED is a graphical display program that can be used via a simu- 
lation on a digital computer to (i) explore the distribution of a statis- 
tical estimator for a given sample size, (ii) to compare the properties 
of that distribution when the estimator is calculated for various sample 
sizes, and (iii) to contrast those properties under different estimation 
conditions. Those conditions are controlled by the experimenter but, 
most commonly, they will entail competing estimation procedures (e.g. 
maximum likelihood versus methods of moments, or jackknifed versus not 
jackknifed). The program is flexible enough to accommodate the imagina- 
tion of most users and, in one of the examples, we also consider the 
effects of changes in the underlying distribution of the data. 

One salient feature of the program is that it uses the same batch of 
simulated random variables (e.g. Normals) to explore the properties of 
all the estimators at various sample sizes. This is done for economy of 
coirputer time and could be important on slow computers; the price paid 
is that the analytical analysis provided by SIMTBED of its graphical 
output is performed on correlated samples. 

To use the program it is necessary only to define the optional input 
parameters, supply the simulated random variables, and provide the 
Fortran functions which, when passed the data and subsample size, 
transform (if desired) the data subsample and compute the desired 
statistics. SIMTBED itself will subdivide and feed the data properly 
into the functions, produce boxplots and summary statistics, and compute 
regressions for the mean and variance of each estimator based on inverse 
subsample size. Up to three estimators can be used with the option to 
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produce equally scaled graphs for all the statistics. 

The features of the program are more easily demonstrated by example 
rather than explanation and so we will proceed directly to two applica- 
tions. The first application refers back to a simulation study done by 
Cox (1966) looking at the behavior of the estimated first order serial 
correlation coefficient, Fisher's z-transform of the estimated correla- 
tion, and the 2-fold jackknifed estimate of the correlation for i.i.d. 
Normal ( 0, 1 ) ,, x 2 d) and Lognormal (0,1) data. The jackknife was origi- 
nally proposed by Quenouille (1948) for the purpose of removing bias from 
the correlation estimate. The second application considers the problem of 
estimating the shape parameters for a highly skewed Garma(.25) and a 
nearly Norma 1 Gamma (5.0) sample using m.l.e., method of moments, 4-fold 
jackknifed m.l.e., and 4-fold jackknifed method of moments as the corm- 
peting estimators. 

Technical details concerning the SIMTBED software, not essential 
to interpreting and appreciating the output, can be found in Linnebur 
(1982), and an application to the analysis of output in a regenerative 
simulation can be found in Heidelberger and Lewis (1981). 

2.0 Calculation of the First Serial Correlation Coefficient 

It is kncwn that for an independent sample from a population with 
finite variance, the distribution of the serial correlation coefficient 
(Anderson and Walker, 1964) is asymptotically Normal with mean zero and 
variances 1/n, where n is the sample size. If the population is i.i.d 
Normal then the bias is exactly -1/n. Since those asymptotic properties 
are frequently used as approximations in tests of significance, it is 
important to knew hew valid the approximation would be in small samples 
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from a variety of distributions. We will look at that question in the 
next two sections and then yo on to consider two alternative measures of 
correlation, Fisher's z-transform and the 2-fold jackknifed estimate of 
the correlation. Their ability to reduce bias and/or induce Normality 
will be examined against other changes in the distribution of the 
estimators, particularly variance inflation. A simulation study, without 
graphics, of sane of these problems was conducted by Cox (1966). He did 
not consider the jackknifed estimate. 



2.1 SIMTBED Output for Serial Correlation 

Figure 1(a) shows the simulated distribution and sample properties 
of the serial correlation coefficient estimate 



r = 



n 



n-1 

n l (X- X. ) (X . , - X ) 
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(n-1) l (X - X ) 2 
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where X. = £ X./n , X = £ X./(n-l) , and X = £ X /(n-1) 

U j=l 3 1 j=l 3 n j=2 3 



for various sub-sample sizes n=rr. This definition matches that 
used by Anderson and Walker (1964). We consider first subsamples of 

size n^ = 10, and then of size n^ = 20, n^ = 30, n^ = 40, n^ = 50, 

r\ = 75, n_ = 100 and n 0 = 150, successively. For each subsample size 
the input sample of N = 5000 simulated Normal (0,1) random variables is 

divided into as many full subsanples of size n^ as possible, and the 
serial correlation is computed for each of the |N/nJ subsanples of size 
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n. 

1 



. The entire procedure is then replicated M = 10 tines, each time 



with a new simulated sample o£ N = 5000 Normal (0,1) variables. 

After all M replications have been run, all the estimates of serial 
correlation for each subsample size are grouped together and their 
simulated distribution is presented via a boxplot and summary statistics 
(see e.g. Fig. 1(a)). The boxplot follows the standards discussed in 
Mosteller and Tukey (1077) with the median denoted by a + within the box, 
the mean by a * within the box, the outliers by 0's, and the far outliers 
by *'s beyond the whiskers. The sutrmary statistics include the sanple 
mean, sample standard deviation, estimated standard deviation of the 
sample mean (i.e. sample standard deviat vl/nd), sanple skew- 



ness and sample kurtosis of the correlation estimates. 

Looking at the output, the first (leftmost) boxplot in the graph in 
Figure 1(a) shows the distribution of 



estimates of serial correlation from independent subsarrples of size 
n^ = 10. Suitmary statistics for the boxplot can be found below the graph 
in the column labeled "Subsanple Size 10", so that the average serial 
correlation is -.1074, and the estimated standard deviation is .2996. 
The estimated standard deviation of the serial correlation estimate is 
,2996/V(5000) = .00424. Recall that this refers to correlation 

estimates based on subsanples of size 10. 

Since the X-axis of the graph represents subsample size, the last 
(rightmost) boxplot shows the distribution of 




(# Replications) x ^ Sample 9 ^lze ^ 

(Subsanple Size) 



10 x 5000 = 10 x 500 = 5000 



10 



5000 

10 x 150 



10 x 33 = 330 
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estimates of serial correlation from independent subsamples of size 
n^ = 150. Although the 330 estimates are independent of each other, they 
are not independent of the 5000 estimates that corrprise the first boxplot 
since the sane data (divided and processed in different ways) was used for 
both. Suirtnary statistics shew that the average correlation has dropped to 
-.007372, indicating the fall off in bias, and the standard deviation has 
dropped to .07822, indicating the greater precision with which the correl- 
ation can be estimated when 150 points, rather than 10, are available. 

In' order to quantify the changes that are occurring in the mean and 
variance of the distribution of the estimator as subsample size changes, 
SIMTBED performs two types of regressions. The first regression is on 
the averages and is done after each replication, using the average serial 
correlation for that replication, r , as the dependent variable. 

l 

Inverse powers of the subsample size serve as the independent variables. 
For Figure 1(a) the degree of the regression was chosen to be D=3 so, 
for each replication, the equations we attenrpt to fit by least squares 
are: 



n. 

l 



- a, 



+ a l 



+ a 2 



+ a 3 



for i = 1,2. .. .8. 



n . 

l 



n . 



3 

n. 

i 



This form anticipates the general asyrptotic expansion 

a, . a„ a. 



E (0(n) ) = 0 + _1_ + _2 + _3 

n 2 3 



n n 

which holds true in the current situation with 0=0 and (in the Normal 
case) a = -1 (see Cramer, 1948, for general results of this type). 

Values of a^, a^, a and a^ are calculated after each repli- 
cation, averaged across the M replications to get a ( ^, a-^, a 2 , and a^, 

and then the averages are reported below the summary statistics on the 
line "Mean of Regression on Averages - Coefficients". We find that 
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= -.000272 and a^ = -1.03074, both close to their respective 
theoretical counterparts of zero and - 1 . 

Because we have 10 replications and therefore 10 independent values 
of each of a^, a^, a^, and a^, we can also estimate the variances and 



standard deviations of a^, a^, a r ^, and a^ across replications. 

These values are presented on the two lines immediately belcw the 
coefficients. For instance, the estimated s.d. of the estimate 
a Q = -.000272 of a 0 is .003892. 

The regression line for the mean value of the estimator is presented 
visually in the graph as a dotted curve. The estimated asyirptote 
(i.e. a^) is printed with a dashed line wherever it does not coincide 
with the regression line. Bias, therefore, can be viewed as the 
difference between those two lines. 

The second regression referred to above is done after all replica- 
tions have been run and the variances of the estimators at each subsarrple 
size have been calculated. (Note that the standard deviations, not the 
variances, are presented in the summary statistics. ) It should be 
recalled from previous discussion that these variances, as well as all 
measures in the summary statistics, are based on the grouping together of 
the serial correlations from all replications , at each subsarrple size. 
This is in contrast to the the procedure for the regression on the means, 
where average correlations are conputed for each subsarrple size for each 
replication . In the case of the variances, we have 8 equations: 



Var ( r ) = 
n. 






9 

5/2 

n. 



i = 1,2 8 , 



which we fit by least squares in order to estimate the coefficients 
Bq/ Bjy $ 2 ' and ^3 i n the presumed asymptotic expansion 
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Var (0 (n)) = 6 U + g l + B 2 + B 3 + 

n 3/2 2 5/2 

n n n 

This expansion holds for the variance of the estimated serial cor- 
relation coefficient for independent data. Usually it will be in 

which we are rrost interested since g^ is used in computing asyrrptotic 
relative efficiencies of estimators. For independent data with finite 
variance, we knew that g^= 1. The corrputed values of by b^, by 

and by are presented on the line labeled 'Regression on Variance - 
Coefficients'. Notice that b^ = .7438 is close to the theoretical 
value of 1. 

The final two numbers on Figure 1(a), YMIN and YMAX, sinply shew 
the scale of the vertical axis. Because the SIMTBED program option to 
put Figures 1(a), 1(b) and 1(c) on the same scale was in effect, it may 
be that no boxplot in a given Figure (eg. Figure 1(b)) requires the full 
range of Y- values. 

In order to produce Figure 1(b), the Normal (0,1) data that went 
into Figure 1(a) was squared to create longer tailed x 2 (D random 
variables. The output is entirely analogous to that for Figure 1(a). 
Similarly, for Figure 1(c), the Normal (0,1) data was exponentiated in 
order to create Lognormal(0, 1 ) data and to produce analogous graphical 
output. The indication is that the distribution of the sample serial 
correlation is robust with respect to the population distribution. 

The features of the SIMTBED output will become clearer when they are 
associated with the various properties of the correlation estimator. 
First, however, a few technical comments concerning the regressions are 
necessary. 

2.2 Sore Comments on the Regressions 

Two types of problems, numerical and statistical, can occur when 
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attempting to fit the two sets of regression equations presented in 
Section 2.1. 

First, there is the question of numerical stability when the 

. . , . . . f. -1 -2 -3, r -1 -3/2 -2 -5/2, 

independent variables, {1, n^ , n^ , n^ } or 1 n i ' n i ' n i ' n i * 

decrease geometrically. If we attenpt to form X T X, where X is the 
respective design matrix and X T is the transpose of X, we get values 

8 -6 

that range from 8 (assuming 8 subsample sizes) to £ n. 

i=l 1 

y n -2 to y n -5 

for the regression on the means, and i n i 

for the regression on the variances. Experience has shown that attempts 
to solve systems with such extremes in the xTx matrix produce erroneous 
results. Consequently, SIMTBED scales the design matrices by multiplying 
each entry of X by Max(n^) raised to the proper pcwer so that no 
entry becomes too small. The standard Choleski decorqponsition (see 
Dahlquist, Bjorack and Anderson, 1974) is then used to fit the equations, 
and the coefficients are properly rescaled before they are reported. 

This procedure produces numerically reliable results. 

The second problem concerns the breakdown of statistical assumptions 
in our regression models. It has already been pointed out in Secion 2.1 
that the two sets of dependent variables: 

(1) the O(n^) when considering the regression on the means; 



( 2 ) 



M|N/n.| 

the s 2 ( n . ) = ^ y 
1 3=1 



(6 . (n. ) 
D 1 



0 ( n , )) 2 



/ M |^N/n A 



where 0(n^) is the mean across the M replications, when 
considering the regression on the variances. 
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Table 1 



Entries in the table are the estimated correlations between the estimated 



variances of the 
for i=l, . . . ,8, j 


r at 

n . 

l 

1 / • * • / £3 « 


different subsarrple sizes: 


Corr (s 2 


<r n.>' 5 
1 


2 <v» 

J 


i 

j 


1 


2 


3 


4 


5 


6 


7 


8 


1 


1.00 


.49 


.46 


-.26 


.18 


-.17 


.14 


.01 


2 


.49 


1.00 


.40 


. 55 


.11 


.38 


.38 


-.03 


3 


.46 


.40 


1.00 


.23 


.23 


.44 


.21 


.29 


4 


-.26 


.55 


.23 


1.00 


.42 


.86 


.57 


.35 


5 


.18 


.11 


.23 


.42 


1.00 


.71 


.43 


.59 


6 


-.17 


.38 


.44 


.86 


.71 


1.00 


.45 


.53 


7 


.14 


.38 


.21 


.57 


.43 


.45 


1.00 


.72 


8 


.01 


-.03 


.29 


.35 


.59 


.53 


.72 


1.00 



Recall that r n is the estimated serial correlation for a simulated Normal(U,l) 
subsarrple of size n. Also, the estimated correlations shewn above were com- 
puted using 10 values (replications) of s 2 (r ) and s 2 (r ) for each i and j. 

n i n j 



Table 2 

A comparison of the estimated variance of s 2 (r n ) with the approximate 

i 

theoretical variance of s 2 (r n ) and with the approximately equivariant scaled 

i 

5 

versions, n.‘ s 2 (r ). All entries have been multiplied by 10 5 . 

l 



o 
1 — 1 

II 

• rH 

c 


20 


30 


40 


50 


75 


100 150 


Var (s 2 (r n )) .177 

i 


.150 


.204 


.079 


.047 


.031 


.049 .022 


Approx. Theoretical 


Var (s 2 (r n )) .400 

i 


.200 


.133 


.100 


.080 


.053 


.040 .027 


Var(n~* 5 s 2 (r n )) 1.77 


2.99 


6.12 


3.18 


2.33 


2.33 


4.88 3.35 


The estimated variances of s 2 


(r ) 
n . 

l 


and 


/n.s 2 

l 


(r ) 
n. 

l 


were 


ealeu la ted 



using 10 independent replications of s 2 (r n ). 
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are not independent over i since all are based on the same simulated 
data. The extent of the dependence is demonstrated by the correlation 
matrix in Table 1. Entries in that table shew the estimated correlation 
between s 2 (n^) and s 2 (n_.) for all i and j, where the estima- 
tion was done by repeating the SIMTBED experiment with 10 different 
batches of 50,000 simulated random variables. Since only 10 values went 
into each correlation calculation, the table is only accurate to within 
approximately ± 2// 10 = .632. We see sane indication of positive 
correlation, especially when i and j are close, but the lack of 
independence is not severe enough to hurt the regression results for 
either the estimated means or variances significantly. 

A second assunption, inplicit in any regression, is that the 
dependent variables have equal variances. This condition holds true for 
the means, which can be shewn to satisfy 

Var (0(n^ ) ) = | 



independently of i. The estimated variances, 

A 

equivariant and, if we assume the G^(n^) 
Normally distributed so that 



hewever, are not 
to be approximately 



M 



Lv /n il 



-I 

j=i 



( VV 



0(n. )) 2 



is approximately proportional to a Chi-squared random variable, with 
M I N/n.l - 1 degrees of freedom, we can conpute 



2 

Var (s 2 (n.)) = — . 

MNn.-n . 

l l 

To correct this problem of unequal variances SIMTBED scales the 
s 2 (n.) by /n. so that 

l l 
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Var (/n^ s 2 (n^ ) ) 



2 



MN - n. 
1 



2 

MN 



since MN » n^ . The design matrix is scaled accordingly and the values 

by, b-^, b^t and b^ discussed in Section 2.1 are reported. 

Table 2 shews the effects of the rescaling by presenting first the 

estimated variances of the s 2 (n^), where the estimation is done by 

repeating SIMTBED for 1U batches of 50, QUO simulated data points. These 

estimated variances decrease as n^ increases, closely paralleling the 

second line of Table 2 which has the approximate theoretical values 
2 

(i.e. 2/(MNn^- n^ ) ) . The final line of Table 2 shows the estimated 

variances of the rescaled s 2 (n.) , i.e. the /n. s 2 (n.), which, as 

i ii 

expected and hoped, shew a more constant variance with i. 

Although future versions of SIMTBED will include more sophisticated 
regression routines and the ability to generate independent samples at 
each subsample size, the current version is quick, usable, and accurate 
for mast situations. 



2.3 Interpreting the Serial Correlation Results 

Returning to Figure 1(a) which shews the simulated distribution of 
the serial correlation coefficient from independent, Normal(U,l) data, 
the following comments summarize the most striking features: 

(a) The boxplots appear very symmetric at all subsample sizes with nearly 
equal numbers of outliers at either tail and with mean and median coinci- 
dental. This observation is confirmed by the estimates of skewness in the 
summary statistics. Kurtosis is mildly negative at small subsample 
sizes but, overall, asymptotic Normality seems to take hold rather 
quickly. Note however, that at n^= 10 there are only 3 outliers in a 
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sample of size 5000. This is consistent with the estimated kurtosis of 
-0.424, shewing that the distribution is quite nonnormal. 

(b) The average serial correlation is negative for small subsamples. 

This is demonstrated by the dotted regression curve which starts at 
approximately -.10 and levels off near 0 for subsamples greater than 
about 85. The dashed asymptote of —.000272 is very close to the 
theoretical value of 0, and the mean values in the summary table closely 
reflect the bias of - 1/n. 

(c) The standard deviations of the simulated distributions are very 

-0 5 

close to the asyirptotic values of n^ * , although the lead coefficient 
in the regression on the variances, b^ = .743756, is not as close to 
the theoretical value of 1 as we would hope. When SIMTBED is repeated 
10 tines with 10 different batches of simulated data, we find an average 
value for b^ to be 1.0604, with a standard deviation for b^ of 
.307. The estimation procedure for b^, therefore, remains valid, but 
the estimate itself is highly variable. 

The agreement between the simulated and the theoretical, asymptotic 
values of the bias and variances was discovered previously by Cox (1966). 
SIMTBED has new allowed us to automatically look at a broader range of 
subsarrple sizes and to see, through boxplots and estimates of skewness 
and kurtosis, a fuller picture of any changes in the distribution of the 
estimator. We can be satisfied that estimates of serial correlation do 
behave approximately as Normal (-1/n, 1/n) random variables when the 
underlying data is Normal(0,l). 

If the lead terms in the expansions of the mean and variance of the 
estimated correlation coefficient (ie. a^, a^, and b^ ) had been 

unknown, we would also have a fairly good idea new of what they were. 



12 



When the underlying data is x^/ Figure 1(b) confirms Cox's obser- 
vation that the bias is relatively uneffected but, for small subsarrples, 

the standard deviation is smaller than the expected n . Unlike 
Figure 1(a), there is a pronounced skewness in the boxplots in Figure 1(b) 
with many more outliers at the positive end, and with the mean higher than 
the median at the first four subsarrple sizes. The problem of suppressed 
variance seems cured at n = 100 and n = lbO, but the skewness remains 

/ O 

and could cause problems in tests of significance. 

Figure 1(c), which is based on an underlying batch of simulated 
Lognormal (0,1) data, shews a slight exaggeration of the effects in 
Figure 1(b). The standard deviation is more suppressed and does not 
attain the theoretical level by n^ = 150. The positive skewness is more 
pronounced and kurtosis does not approach the theoretical value of 0. 

Overall, the effects of long-tailed data on the distribution of the 
serial correlation coefficient can be summarized as follcws: 

(i) Bias is not significantly affected and remains at approxi- 
mately -1/n. 

(ii) The variance of the distribution of the serial correlation 
coefficient is reduced by longer-tailed data. 

(iii) Positive skewness is created in the distribution. 

(iv) Kurtosis may become positive at large subsarrple sizes. 

(v) For long-tailed data (i.e. Lognormal), a subsample size of 
150 is not large enough to insure asymptotic Normality. 

2.4 SIMTBED Output for the z-Transform of the Correlation 

Fisher's z-transform of the estimated correlation coefficient is 
defined by: 
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z 

n 




1 + r 

n 

1 - 

n 



where is the estimated serial correlation presented in Section 2.1. 

The transformation is intended to make the distribution of the Z more 

n 

Normal than that of the r^ . When the same SIMTBED experiment described 

in Section 2.1 is run using Z as the estimator instead of r , we 

n n 

get the results shewn in Figures 2(a), 2(b) and 2(c). It should be noted 

that the scale of the boxplots here has been forced to be approximately 

conparable to the scale for the boxplots in Section 2.1. This is done 

by suppressing outliers that are more than 1.5 interquartile distances 

beyond the quartiles of the boxplot. If we had allowed the data to scale 

the boxplots, we would have seen a much wider range on the vertical axis 

because the Z are not restricted to the limits of -1 to +1 and be- 
n 

cause there is one far outlier at -3.8. In this type of "reduced 

graphics", we still see the number of outliers that fall beyond the 

allowable range through the numbers at the ends of the boxplots, but we 

do not see their actual locations. 

Figure 2(a) shows the distribution of the z-transformed correlation 

coefficients when we use simulated Normal (0,1) data. At each subsanple 

size, the mean and standard deviation are close to the theoretical 
-1 - 1/2 

n and n respectively. The skewness and kurtosis at subsanple 
size n^ = 10 are far from the theoretical Normal distribution values 

of 0 and 0, reflecting partly the one far outlier at -3.8 and partly 
the negative skew in the remainder of the Z^ 's. For other subsanple 

sizes, there is no strong evidence to contradict the assunption of 
approximate Normality. 

The relationship between Figure 2(b) and 2(a) is similar to that 

2 

between 1(b) and 1(a). Figure 2(b), which is based on sinulated 
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data, shews (a) bias that is the same as for the transformed correlations 
based on Normal data, (b) slightly suppressed variances, particularly at 
small subsample sizes and (c) positive skewness which persists at large 
subsarrple sizes. In addition, there are signs of positive kurtosis at 
small subsample sizes. 

Figure 2(c) is based on Lognormal (U,l) data and shews high values 
of skewness and kurtosis at almost all subsample sizes. Approximate 
Normality seems an unwarranted assumption. In fact, the kurtosis is 
converging very slowly to its asymptotic value of 0. 

In general, using the z-transform does not help with Normality 
assumptions, especially when dealing with long-tailed distributions. 

2.5 SIMTBED Output for the 2-Fold Jackknife of the Correlation 

The final Figures, 3(a), 3(b) and 3(c), deal with the 2-fold jack- 
knife estimate of correlation. Again, the figures are reduced graphics 
with scaling comparable to that of the boxplots of Sections 2.3 and 2.4. 
To define the estimator, we start with a given subsample of size n, 
compute the serial correlation for the first |ji/2j points and call it 
r^(n/2), compute the serial correlation for the second L.n/2J points and 
call it r 2 (n/2) and compute the serial correlation for the entire sub- 
sample of n points and call it r 0 (n). Bach conputation follcws the 
formula in Section 2.1. The three estimators are then combined to form 
two pseudo- values, 

r 1 *(n) = 2r Q (n) - r^n/2) 

and r 2 *(n) = 2r Q (n) - r 2 (n/2) , 

and the final jackknife estimator for that subsample is defined as 
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r(n) = 



r *(n) + r *(n) 

1 2 
2 

Although a jackknife estimator may have many favorable properties, we are 
concerned here primarily with its ability to remove bias, hopefully 
without inflating the variances of the estimator and/or inducing 
nonnormality. 

Figure 3(a), based on simulated Normal(0,l) data, shows nearly 
complete removal of bias, even at small subsample sizes. The cost of 
the bias reduction is reflected in an increase of nearly 50% in the 
standard deviation of the correlation estimate for subsample size 10, and 
lesser relative increases at larger subsample sizes. There is also an 
indication of a positive skew for small subsample sizes, and the problem 
that the jackknife estimator need not fall into the -1 to +1 range 

which is desirable for a correlation coefficient estimate. 

2 

When using simulated data as in Figure 3(b) , or simulated 

Lognormal (0,1) data as in Figure 3(c), there is again no problem with 
bias. Variance inflation, though it exists at small subsample sizes, 
is not as large as when Normal (0,1) data is used. The distributions of 
the jackknifed correlations shew very pronounced positive skews, however, 
as well as positive kurtosis. These two problems are worse for the 
longer-tailed Lognormal data. 

Overall, the jackknife estimator is very successful at removing bias 
but the costs include variance inflation, which can be severe at small 
subsanple sizes, plus increased positive skewness and kurtosis when the 
estimates are based on data from longer-tailed distributions. 

2.6 Comparison of the Three Estimates of Correlation 

For Normal(0,l) data, the distribution of the usual correlation 
coefficient displayed in Figure 1(a) behaves very much as theoretical 
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asynptotic calculations would predict, even at small subsairple sizes. 

This makes it possible to correct for bias in the estimator and to 
perform tests of significance. Use of Fisher's z-transform, as illus- 
trated in Figure 2(a) does not seem necessary since it does not 
significantly irrprove the approximate Normality of the estimator. The 
jackknife estimator in Figure 3(a) may be valuable if a direct, unbiased 
estimator is needed but the inflated variance of the jackknife estimator 
may limit the usefulness of the estimate as well as make any tests of 
significance too conservative. 

When the underlying data comes from a longer-tailed distribution, 
the usual correlation coefficient in Figures 1(b) and 1(c) retains a 
predictable bias term, although the variance of its distribution is 
slightly depressed and the skewness and kurtosis becomes positive, even 
for subsamples as large as 150. This means that it is still possible to 
estimate the correlation accurately, but tests of significance fall on 
shakey assumptions of Normality. The z-transform in Figures 2(b) and 
2(c) does little to firm up those assumptions and, in some cases, make^ 
the situation worse. As in the case of Normal data, the 2-fold jack- 
knifed correlation in Figures 3(b) and 3(c) is bias-free but follows a 
fairly nonnormal distribution which would invalidate significance 
testing. 

All of the preceding observations and conclusions flow immediately 
from the nine Figures presented so far. Further studies could easily be 
done through SIMTBED, looking at larger subsarrple sizes, correlated data, 
i.e. p * 0 and alternative marginal distributions. For demonstration 
purposes, though, it is better to proceed to our second application. 



3.0 Estimating the Shape Parameter for a Gamna Distribution 

As a second application of SIMTBED, we will consider a problem 
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which has received much less statistical attention; asynptotic results 
are summarized in Cox and Lewis (1966, Ch.3) and Johnson and Kotz 
(1970, Ch.17). We want to estimate the shape parameter, K, for a Garrma 
distribution, where the Ganma density is given by 
K K— 1 -Kx/u 

£(X) = TU) Q X > 0; K > 0; y > 0 

= 0 x < 0 . 

Notice that the mean of this distribution is y, not K/y as in some 
differently parameterized versions of the Gaitina density. For the data 
that will be simulated for use in SIIOTBED we will use K = 5 and y = 1 
and K = 0.25 and y = 1. The closer the mean of our estimate is to 5 
or 0.25, the better (in terms of bias) is our estimation procedure. 

Other factors such as the variance and Normality of the estimator will 
of course also have influence in the determination of a prefered 
estimator; the bias and variance could be combined into m.s.e. 

Section 3.1 will corrpare the commonly used maximum likelihood 
estimator which is mildly difficult to corrpute to the carpeting method 
of moments estimator which is very simple to corrpute. Both procedures 
result in asymptotically Normal estimators (Cramer, 1948) but the 
m. l.e. is usually prefered because of its favorable asymptotic relative 
efficiency (Cox and Lewis 1966). Through SIMTBED, though, we will see 
that for small subsamples the estimated variances of the two estimators 
of K are not as far apart as asynptotic results lead us to believe. 

In addition, the bias that appears in both estimators is smaller for the 
moment estimator. 

In Section 3.2 we will use a four-fold jackknife of both the m.l.e. 
and moments estimators to successfully remove the bias. What is remark- 
able is that unlike the jackknifing of the serial correlation, there is 
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little or no cost in terms of variance inflation and nonnomvality for the 
jackknifed moment estimator. When K = .25, we will see in Section 3.3 
that the jackknifed m.l.e. dominates the other three estimators at all 
subsample sizes when considering the mean, variance, and Normality of 
the estimator. 

3.1 Maximum Likelihood and Moment Estimators of K 

Figure 4(a) is very similar in format to the figures that have 
already been presented for the correlation exaiiple except that: 

(1) The estimator whose distribution is being displayed is the maximum 
likelihood estimator of K, the shape parameter of a Gaimna(5) popula- 
tion. We denote the estimator, corrputed from a simulated subsample of 
size n, by K(n) and define it to be the solution of the equation: 

„ . n n 

n [ log K(n)- Y(K(n))] = n log £ X j/ n - £ log , 

i=l i=l 

where the X^ are the simulated Gamma (5) random variables and f(.) is the 
digamma function (Cox and Lewis, 1966). 

(2) The eight subsanple sizes which we will be looking at are n = 33, 

n„ = 50, n 0 = 71, n. = 100, n c = 125, n c = 166, n., = 250 and 

n g = 500. Note the difference between the n.'s in the previous exanple 
and these n^'s. Since these are larger we will not see rruch small 
sample detail, but we will see soma of the asynptotic (n = 500) effects 
coming in. 

(3) At each subsample size we will work with M* = 20 independent 
replications of N* = 2500 simulated Gamma (5) random variables, 
instead of the M = 10 replications of N = 5000 variables used previously. 
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The total number of independent simulated random variables across 
replications remains constant at the program maxinum of 50, 0U0. Hence, 
the boxplot at subsample size 50 in Figure 4(a) represents the distribu- 
tion of M* [_N*/5Uj = 1000 estimates of K(50) just as the boxplot at 
subsanple size 50 in Figure 1(a) represents the distribution of 
M [_N/50j = 1000 estimates of r(50). As long as the product, M x N, 
remains constant, the only effect of changing the number of replica- 
tions is, up to rounding in N/n to change the results in the 
regression on the averages. By using M* = 20 and N* = 2500, SIMTBFO 
reports regression coefficients averaged over 20 replications, but, 
within each replication, the dependent variables are averages over just 



(4) The boxplots are presented using the reduced graphics option. In 
this option any extreme outliers (i.e. those beyond 1.5 interquartile 
distances) are included as a count at the tail of each boxplot. This 
option was chosen in order to give more graphical weight to the body of 
the distributions and the fall-off in the bias. Limited printer resolu- 
tion makes it inpossible to show details in the body and the tails of the 
distributions if there are many straggling outliers. In the case of very 
extreme outliers, no detail would be seen in the body of the boxplot 
without the reduced graphics option. 

Figure 4(b) looks at the distribution of the moment estimator of K, 
the shape parameter of a Ganma(K) population: 



simulated Gamma(5) random variables. The SIMTBED options and para- 




values of the estimator. 



n 



K(n) = (n-1) X 2 / l (X. - X) 2 

i=l 1 



n 

where X = £ 

i=l 



X^/n , n is the subsample size, and the X^ are the 
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meters mentioned in (2), (3) and (4) preceding are also in effect here. 

The two figures, 4(a) and 4(b), shew a very pronounced bias in both 
estimation procedures, although the moment estimator is slightly closer 
to the unbiased value of 5. As expected, the standard deviation of the 
m. l.e. is lower than that of the moment estimator although the relative 
difference at small subsample sizes, for instance 1.448 versus 1.482 at 
n^ = 33, may not outweigh the increase in bias with the m.l.e. At larger 

subsample sizes, the relative difference is close to the theoretical 
asymptotic relative efficiency of .78 (i.e. .91 at n^ = 250) 

Both estimators also shew distributions with positive skewness and 
kurtosis that decrease to the asymptotic 0 levels as subsample size 
increases. The asymptotics appear to take hold more quickly for the 
moment estimator than for the m.l.e. 

In summary, SIMIBED shews that the m.l.e. is indeed better than 
the moment estimator in terms of variance, but not as good for small 
sample sizes as asymptotic results would lead us to believe. In the 
other areas of bias and asymptotic Normality, the moment estimator would 
have to be preferred. 

3.2 4-Fold Jackknifed Estimators of K 

Figures 4(c) and 4(d) show the distributions of the 4-fold jack- 
knifed m.l.e. of K and 4-fold jackknifed moment estimator of K, 
respectively. A 4-fold jackknife estimator is similar to the 2-fold 
jackknife estimator described in Section 2.5 except that there are 4 
pseudo-values that come out of dividing each subsanple into fourths. 

More details can be found in Mosteller and Tukey (1977). 

The purpose of the jackknife is to remove the conspicuous bias 
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observed in Figures 4(a) and 4(b). This goal is seen to be acconplished 
in Figure 4(c) and 4(d) and we can also note smaller values of skewness 
and kurtosis, indicating a quicker approach to asymptotic Normality. The 
skewness and kurtosis of the jackknifed moment estimator are the lowest, 
at small subsanple sizes, among all estimators. The variance of the 
jackknifed moment estimator is also only slightly inflated, as is the 
variance of the jackknifed m.l.e.. 

All told, the jackknifed moment estimator, because of its lack of 
bias, small- variance, and low skewness and kurtosis, would be the method 
of choice if estimation of K or significance testing was the goal. 

3.3 Results for K = 0.25 

In Figures 5(a), 5(b), 5(c) and 5(d) we show similar results to 
those discussed above for the case K = 5.0, but using K = 0.25. The 
fact (Cox and Lewis, 1966, Ch.3) that the m.l.e. estimate is much more 
efficient than the moment estimate is graphically illustrated. What is 
new is the effect of jackknifing: bias is reduced without the sacrifice 

of variance inflation or nonnormality. 

Further comparisons and interpretations are similar to those done 
for the case K = 5.0, and are left to the reader. 

4.0 Conclusions 

Sinply by providing SIMTBED with the desired estimators, we have 
been able (a) to explore in depth the effects of changes in data distri- 
bution and of different estimation procedures on the calculation of the 
serial correlation coefficient, and (b) to compare four different ways to 
estimate the shape parameter in a highly skewed Gamma population. 
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The graphics and numerical output combine to let us see and quantify 
distributional changes that occur as subsample size grows. We can see 
bias fall away, variance shrink, and skewness disappear as the estimator 
approaches asynptotic Normality. Terms in the asynptotic expansion of 
the mean and variance of the estimator are automatically calculated and 
can be used to corrpare different estimators. 

Ease of use and portability, hcwever, remain as SIMTBED's most 
important features, and will hopefully inspire users to try more diverse 
and extensive simulation experiments. 

Other graphical displays besides running boxplots can be used; some 
alternatives are given in Lewis (1972), Heidelberger and Lewis (1981) and 
Devlin, Gnanadesikan and Kettenring (1981). 

5.U Availability 

SIMTBED is at present only available in a version run on the 
IBM 3033. Hcwever, since the program uses standard FORTRAN and is 
independent of any software packages, conversion should be sinple. 
Versions for VAX machines will be tested shortly. 
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SAMFLE SIZE (N): 5000 NO. OF REPLICATIONS <M) s 10 DEGREE OF REGRESSION <D> 
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SAMPLE SIZE INI: 5000 NO. OF REPLICATIONS (Ml: 10 DEGREE Of REGRESSION ID) 
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SAMPLE SIZE ( N 1 : 5000 NO. OF REPLICATIONS <M>: 10 CEGREE CF REGRESSION <01 
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VERTICAL SCALE: YMIN = -0.9890 

YMAX = 0.7503 

ESTIMATOR: ESTIMATES OF THE Z-T RANSF C RM OF THE SERIAL CORRELATION COEFFICIENT FOP A IGGNCRM AL < 0 . 1 ) SAMPLE 



SAMPLE SIZE 5000 NO. OF REPLICATIONS (M) : 10 OEGREE OF REGRESSION (0) 
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ESTIMATOR: ESTIMATES OF THE 2-FOLC JACKNIFED SERIAL CORRELATION COEFFICIENT FOP A NOR MAL ( 0 * I ) SAMPLE 



SAMPLE SIZE IN): 5000 NO. OF REPLICATIONS <M» : 10 CEGREE OF REGRESSION (0) 
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VERTICAL SCALE: YMIN = -1.2072 

YMAX = 1.2114 

ESTIMATOR: ESTIMATES OF THE 2-FCLC J7CKNIFED SERIAL CORRELATION COEFFICIENT FOR A CHI - SQU AR E ( 1 ) SAMPLE 



SAMPLE SIZE IN): 5000 NO. OF REPLICATIONS <M): 10 CEGREE Gf REGRESSION (0) 
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ESTIMATOR: ESTIMATES OF THE 2-FCLC JACKNIFEO SERIAL CORRELATION COEFFICIENT f 0R A L OGNOF M AL ( C ♦ 1 I SAMPLE 



SAMPLE SIZE (N I : 2500 NO. 0^ REPLICATIONS 1M: 20 OEGREE OF REGRESSION (D) 
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REGRESSION ON VARIANCE - CCE FF 1 ( I EN T S : 116.458 -1681.^6 14121.1 -34565 



_ __N 0 . OF REPLICATIONS < M > : 2 C DEGREE cr PEGFESSICN ( 01 : 
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SAMPLE SIZE ( N I : 250C NO. OF REPLICATIONS ( M ) : 20 DEGREE OF REGRESSION ( D ) 
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ESTIMATCF: 4-FCID JACKMFEC MAXIMUM LlKELlFOOD ESTIMATE OF THE SHAPE PARAMETER Of THE GAMMA f I S TR I R UT I CN 
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SAMPLE SIZE <N): 2500 NO. OF R FPL I CA T 1 DN S (M»: 20 DEGP r E OF PEGPFSMON ( C > 
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